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ABSTRACT 

Phylogenetic trees representing the evolutionary 
relationships of homologous genes are the entry 
point for many evolutionary analyses. For instance, 
the use of a phylogenetic tree can aid in the infer- 
ence of orthology and paralogy relationships, and in 
the detection of relevant evolutionary events such 
as gene family expansions and contractions, hori- 
zontal gene transfer, recombination or incomplete 
lineage sorting. Similarly, given the plurality of evo- 
lutionary histories among genes encoded in a given 
genome, there is a need for the combined analysis 
of genome-wide collections of phylogenetic trees 
(phylomes). Here, we introduce a new release of 
PhylomeDB (http://phylomedb.org), a public reposi- 
tory of phylomes. Currently, PhylomeDB hosts 120 
public phylomes, comprising >1.5 million maximum 
likelihood trees and multiple sequence alignments. 
In the current release, phylogenetic trees are 
annotated with taxonomic, protein-domain arrange- 
ment, functional and evolutionary information. 
PhylomeDB is also a major source for phylogeny- 
based predictions of orthology and paralogy, 
covering >10 million proteins across 1059 
sequenced species. Here we describe newly imple- 
mented PhylomeDB features, and discuss a bench- 
mark of the orthology predictions provided by the 
database, the impact of proteome updates and the 
use of the phylome approach in the analysis of 
newly sequenced genomes and transcriptomes. 

INTRODUCTION 

Phylogenomics — the study of genomes from an evolution- 
ary perspective (1) — offers an ideal framework for extract- 
ing relevant biological knowledge from the continuously 



growing amount of available genome sequence data. For 
instance, the origin and evolution of relevant phenotypic 
features of a given group of organisms should ultimately 
be related to underlying genome changes, and these can be 
revealed by particular evolutionary patterns in the 
relevant gene families. Given the plurality of evolutionary 
histories among genes encoded in a given genome (2,3), 
many such approaches involve the reconstruction and 
analysis of large collections of phylogenetic trees. In 
addition, considering the broad range of available 
methods and the expertise needed to define an appropriate 
state-of-the-art phylogenetic pipeUne in a fully automated 
way (4), many biologists benefit from the availability of 
pre-computed phytogenies for the genes and genomes of 
their interest. PhylomeDB was created in 2006 as a reposi- 
tory of complete collections of evolutionary histories of 
genes encoded in a given genome (i.e. the phylome) 
(5,6). It provides alignments and trees enriched with 
relevant annotations, as well as prediction of orthology 
and paralogy relationships, all of which can be searched, 
downloaded or visualized interactively. PhylomeDB is 
unique among other phylogenetic repositories (7-11), in 
that it follows an approach that is both gene-centric and 
genome-wide. In brief [for a detailed description see (6)], 
for each protein-coding gene (the seed gene) in a given 
genome (the seed genome), the PhylomeDB pipeline 
recapitulates the steps that a phylogeneticist will do to 
reconstruct the evolution of a given gene. This basically 
includes finding homologs in a given set of target species, 
which define the taxonomic scope of the phylome, aligning 
their sequences, filtering poorly ahgned regions, selecting 
the most appropriate evolutionary model and building a 
phylogenetic tree. Each step is performed using state-of- 
the-art methodologies and programs. For example, align- 
ments are generated using a combination of three different 
alignment programs run over the sequences in a forward 
and reverse orientation, i.e. the heads or tails approach 
(12). The information from these six different ahgnments 
is not only used to create a consistency-based consensus 
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alignment (13) but also to inform a subsequent filtering of 
alignment columns containing residue pairs observed in 
just one underlying alignment, as implemented in trimAl 
version 1.4 (14). This procedure is applied sequentially to 
every gene in the seed genome, ensuring maximum 
coverage. In addition, this gene-centric approach circum- 
vents the problems associated with defining gene families. 
Gene families are inherently hierarchical in nature, diver- 
sifying in complex ways due to events of gene duplication 
and loss (15). However, current approaches define families 
by clustering a network of pairwise relations to identify 
densely connected sub-networks that cannot represent the 
actual hierarchy present in the data (16). A gene-centric 
approach overcomes this step and results in a comprehen- 
sive collection of evolutionary histories, each one taken 
from the perspective of a single gene. An additional 
advantage of a gene-centric approach is the partial redun- 
dancy contained in the collection, with many evolutionary 
events captured in several trees, built from paralogous 
genes. This enables the use of consistency-based 
approaches in downstream evolutionary analyses, such 
as the detection of duplications (17), and the inference 
of orthology and paralogy relationships (18). Here we 
describe the main new features of PhylomeDB version 4 
and discuss some recent analyses. 



AN EXPANDING PHYLOME REPOSITORY AND 
IMPROVED DATA ORGANIZATION 

PhylomeDB is currently the largest repository of pre- 
computed phylogenies and provides evolutionary compu- 
tations for >10 milhon proteins in ~1000 species. The 
current release has significantly increased in size with 
103 additional public phylomes, meaning roughly a 
7-fold increase. In all, 42 of these new phylomes corres- 
pond to the commitment of PhylomeDB to significantly 
cover the reference proteomes from the quest for 
orthologs initiative (19,20), whereas others are the result 
of large-scale analyses that have been part of scientific 
studies or of collaborations with genome-annotation 
projects. Given the large amount of new phylomes 
stored in PhylomeDB, a set of collections has been 
created that group sets of phylomes. These can unite 
phylomes that use related organisms as seeds 
(e.g. Plants, Fungi, Vertebrates or Bacteria collections) 
or those associated to a given subject or data set 
(e.g. Model species, quest for orthologs reference prote- 
omes). Collections serve to limit the scope of tree searches 
and to facilitate access to the relevant data to a variety of 
user communities, and can be accessed from a section one 
click away from the entry page (http://phylomedb.org/col- 
lections), which provides relevant descriptions. In 
addition, a new phylome search panel has been created 
that aUows filtering phylomes by their species content 
and selecting several of them for the subsequent tree 
searches. Tree searches can be manually limited to a par- 
ticular set of phylomes through the use of the 'phyid' par- 
ameter implemented in our URL query system (Table 1). 
Finally, PhylomeDB version 4 provides coherent sets 
of orthology and paralogy predictions using the most 



up-to-date release of consistency-based predictions from 
the MetaPhOrs database (18). 



MEETING NEW CHALLENGES: 
TRANSCRIPTOME-BASED PHYLOMES 

PhylomeDB approach has been proven useful in the 
annotation and analysis of newly sequenced genomes 
(21-25). Including a phylogenomic approach in the 
genome annotation pipehne serves not only to produce a 
comprehensive catalog of orthology and paralogy rela- 
tionships of the newly sequenced species and their relatives 
of interest but also for other many purposes, including, 
among many others, the reconstruction of the species 
phylogeny, or the detection of gene family expansions 
and contractions that may relate to the emergence of par- 
ticular phenotypes. In recent years, massive transcriptome 
sequencing and assembly has been increasingly used as an 
alternative to the sequencing of whole genomes. This has 
been shown to be a cost-effective approach to address 
many functional and evolutionary questions about an 
organism. We have tested our pipeline and procedures 
in three transcriptome sets for early dipterans (26). 
Compared with high-coverage genomes, transcriptomes 
generally have more missing, incomplete and fragmented 
genes, a scenario that is similar to that of low-coverage 
genomes (27). As a result, large-scale phylogenetic data 
sets derived from transcriptomes are more noisy, and 
downstream analyses have to be carefully interpreted. In 
the mentioned dipteran study, we found that homolog 
identification was severely affected by the fragmented 
nature of the genes in the seed species, and thus transcrip- 
tome-based phylomes are best analyzed in conjunction 
with a phylome generated using a related species as a com- 
plementary seed (e.g. Drosophila in this case). Despite 
the mentioned caveats, transcriptome-based phylomes 
were useful as an efficient way of detecting orthologs 
for functional studies and for addressing phylogenetic 
relationships. Phylomes including transcriptome data in 
PhylomeDB will be tagged specifically to provide the 
users with the choice of using data containing 
transcriptomes. 

THE IMPACT OF PROTEOME UPDATES: THE 
HUMAN PHYLOME 6 YEARS LATER 

Contrary to other databases, trees in PhylomeDB are not 
re-computed in each release. Phylomes are computation- 
ally demanding, and the hmitation of our resources means 
that we face the dilemma of using them to either 
recompute existing phylomes or generate new ones. 
Newer annotations are generated rarely for most of the 
species considered, with the obvious exception of those 
from model species and those coming from constantly 
updated databases such as Ensembl (28). The question 
remains open as to which is the level of change that will 
render one phylome obsolete. Certainly, this depends on 
the desired use of the given phylome, which varies from 
user to user. To assess the impact of proteome updates on 
automatically computed phylomes, we compared two 
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Any sequence identifier (i.e. Uniprot ID, Ensembl ID). Required 

A phylome ID (i.e. 102), a list of comma-separated phylome IDs or a collection ID (i.e. PhyCl). By default a tree from 

the most recent phylome will be selected. 
The preferred evolutionary model for the target tree. Default: best model, 
a comma-separated list of target nodes, defined as follows: 

node_feature|search_pattern|fgcolor|bgcolor, where node J'eature is one of the text-based node attributes available: 

name: leaf names as shown in the tips of the tree (i.e. TP53) 
phylomedb_name: phylomedb ID format (i.e. Phy00086SJ) 
gene_name: original ID used in the source proteome (i.e. ORF_l) 
swissprot_name: a swissprot ID (i.e. P04637) 
trembl_name: a trembl ID (i.e. K7PPA8) 

ensembl_name: any protein, transcript or gene ensembl ID (i.e. ENSP00000269305) 

genolevures_name: an Ascomycete-based Genolevures database ID 

taxid: a NCBI taxa ID (i.e. 9606) 

species: Uniprot species code (i.e HUMAN) 

spname: scientific name (i.e spiens) 

relative_age: any of the tracked NCBI taxa names (i.e. Primates) 

search jKitlern must be a text string or a perl regular expression, fgcolor and bgcolor are optional parameters 
controlling foreground and background colors of the matching nodes (color should be one of the SVG color 
names or a RGB color code) 

Example: http://beta.phylomedb.org/7q = search_tree&seqid = TP53&snodes = species|MOUSE|red,best_name| 

TP73|blue|grey,spname|melano,relative_age|primates|blue|steelblue 
A comma separated nst of tree features to be shown. Currently the following features are supported: best_name, name, 

gene_name, swissprot_name, trembl_name, ensembl_name, genolevures_name, taxid, spname, lineage, motifs and 

support. 

Example: http://beta.phylomedb.org/7q = search_tree&seqid = TP53&tree_features = best_name,ensembl_ 
name,spname,lineage 



Table 1. List of query terms supported by the phylomeDB web API 



URL query term Value 



Seqid 
Phyid 

Method 
Snode 



Tree features 



Only seqid is required to perform a query. 



different versions of the human phylome across eukary- 
otes, the original one (29), pubhshed in 2007, and one re- 
computed 6 years later, including newer releases of all 
proteomes. To our knowledge this is the first time that 
the impact of a proteome update on a large-scale phylo- 
genetic analysis is reported. This phylome update, 
including many model species and differing in 6 years of 
intensive research in the genome annotation field, should 
be considered an extreme case. For instance, Ensembl 
protein sets for human changed from 32 010 to 21088 
proteins (—32%), of which 13 729 were identical among 
both sets, and 4980 included some sort of sequence update 
in the newest release. Other model organisms such as 
Caenorhahditis elegans displayed larger levels of change 
(maintaining only ~50% of nearly identical proteins). 
Seven other species such as Saccharomyces cerevisiae or 
Tetraodon nigroviris only changed shghtly (<1% of the 
data set). Overall most of the species (29 of 39) retained 
relatively high levels (>80%) of equivalences among both 
sets. Because most of the changes, as shown above for 
human, correspond to the removal of predicted proteins, 
the total number of trees (19 565 versus 19 621) and the 
average number of proteins per tree (66 versus 65) remain 
stable, but result on a higher coverage over the query 
proteome (61 versus 93%), suggesting that most 
removed proteins were predicted gene models without 
honiologs in other species. Resulting phylogenies 
differed by a normalized Robinson and Foulds distance 
of 21% different partitions, and 71% of the predicted 
orthologous pairs were conserved among both releases. 



More downstream analyses were less affected such as the 
reconstruction of a species tree using either a gene tree 
parsimony approach (14% different partitions) or the con- 
catenation of one-to-one orthologs (8yo different parti- 
tions). Changes in the final phylogeny correspond to 
variable positions of Gillardia theta and Enzephalitozoon 
cuniculi. Thus, significantly improved predicted gene sets 
can significantly affect downstream analyses to various 
degrees, and thus phylome updates would be recom- 
mended after major releases of the seed proteome or of 
several of the other proteomes included. Three of the 
oldest phylomes have been re-computed using largely 
updated proteomes: the aforementioned human 
phylome, those of Schistosoma mansoni (30) and 
Acyrthosyphon pisum (22). We will be constantly monitor- 
ing the need to recompute phylomes based on the avail- 
abihty of significantly changed proteomes and the general 
use of the existing phylome. We encourage users to suggest 
updates, whenever significant re-annotations are per- 
formed. Deprecated phylomes will still be available for 
download. 



BENCHMARKING ORTHOLOGY IN THE QUEST 
FOR ORTHOLOGS INITIATIVE 

The quest for orthologs initiative aims at sharing know- 
ledge among users and developers of orthology prediction 
algorithms and databases, as well as establishing stand- 
ards in this field (19,20). One of the main achievements 
so far has been the development of a common 
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benchmarking resource, gathering some of the avail- 
able tests assessing different properties of predicted 
orthologous sets (31,32). Although existing benchmarks 
represent a necessarily indirect way of measuring 
accuracy of orthology prediction, and are difficult to in- 
terpret, they nevertheless represent a useful tool for algo- 
rithm developers, who now can observe and understand 
the behavior of their algorithms on different data sets, 
tests and parameter sets. We took this opportunity to 
test a few parameters of our orthology prediction algo- 
rithm. Our algorithm is based on the concept of detecting 



duplications using a species-overlap threshold (16,29). We 
investigated the accuracy of pairwise predictions in 
relation to the nodal distance to the seed. We found that 
drawing predictions of pairwise relationships only for 
pairs of sequences including at least one pertaining to 
the subset of 30 sequences closest to the seed (close2seed) 
improved sensitivity (15% for agreement with reference 
phylogeny test), without significantly altering specificity. 
This is because for large phylogenies with multiple 
paralogs, less rehable signal from collateral trees — i.e. 
those in which the sequence is present but not used as a 



TP53 tree in phylome 76 



(S) AS seed in Human phylome (3) 



Jl jTT iite:-i6468.o) gllMjiilMgillgiZI 



'Image 



-OP53 



(c) 



F6SSG7 

CI-P53/P73-A 



Hard Ink 1 1 Download OrthoXML | [See atignmCTls| [Download 

Drosophila melanogaster 
Ciona intQstinalis 
Ciona intestinalis 



ENSMODP00000018831 
TP 63 
TP63 
■^M jTP63 
TP63 
F7GBH1 
TP 63 
H2QNY5 
DNP63A 
TP 63 
Q4S122 
H2S6K3 



X 



Monodelphis domestica 
Bostaurus 
Canisfamiliaris 
Rattus norvegicus 
Mus muscuius 
Macaca mulatta 
Homo sapiens 
Pan troglodyt9s 
Gallusgallus 
Danio rerio 
Tetraodon nigroviridis 
Takifugu rubripes 



1 1 



Available tre« features 



■ Taxonamy blocks (:59 leaves) 

■ Leaf names <39 leaves) 
□ PhylomeOB names (39 leaves) 
• species names (39 leaves) 
B^Branch supports (39 leaves) 
OTaxon ids (39 leaves) 
O Original names (prot/gene) (39 leaves) 
OTtembI name (25 leaves) 
VswissProt namedO leaves) 
BPFAM domains (78 leaves) 
O Ensembl name (18 leaves) 
n Gene name (23 leaves) 



(f) 



F7GEP9 

ENSBTAP00000007643 
TP73 
TP73 
|rF6VXE7 
^TP73 
I TP 73 
TP73 
F6TKT0 
Q4S837 
H2UMJ4 
TP73 
— H2U134 
■TP53 

ENSXETP00000053761 
-TP53 
TP 53 
TP 53 
TP 53 
:jJ-TRP53 
'^F1VI2U8 
TP53 

TP 53 
H2QC53 



■ Speciation events. 

■ Duplication events. 
^Target sequence 

Node inconsistenc 
seed sequence 




MonodGlphis domestica 
Bostaurus 
Rattus norvegicus 
Mus muscuius 
Macaca mulatta 
Homo sapiens 
Canisfamiliaris 
Gallusgallus 
Xenopus tropicalis 
Tetraodon nigroviridis 
Takifugu rubripes 

Danio rerio 
Takifugu rubripes 

Danio rerio 
Xenopus tropicalis 
Gallusgallus 
Bostaurus 
Canisfamiliaris 
Rattus norvegicus 

Mus muscuius 
Rattus norvegicus 
Macaca mulatta 
Homo sapiens 
Pan troglodytes 



' (g) 




Figure 1. Example of the integrated tree visualization interface showing the gene family phylogeny of TP53. (a) The tree search panel allows 
switching among all available trees containing the target sequence, even if it was not used as a seed (i.e. collateral tree), (b) The tree editing 
menu allows to search nodes matching custrom criteria, select what tree features are shown in the image and download image or other data, 
(c) Lowly supported nodes are highlighted with a transparent bubble and speciation and duplication events are indicated using red and blue colors, 
respectively, (d) A taxonoiny panel indicating the assignment of different partitions to major taxonomic levels. Taxonomic level associated to each 
color is shown on mouse over events, (e) Domain and sequence panel. PFAM motifs are represented by different shapes and can be clicked for 
extended information. Inter-domain coding regions are shown using the standard amino acid color codes. Gap regions are illustrated as a flat line, 
(f) Available tree features. One or more attributes are allowed to be selected to modify the default aspect of the tree image, (g) The tree legend 
indicating color codes of the different tree nodes, (h) The search panel allows to search for node matching any custom criteria of a number of node 
attributes. In the example shown, a node containing the P53_C domain has been highlighted through the use of this panel, (i) The contextual node 
menu, including extended information about a node and links to external data source. 
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seed — would overcome the signal from seed trees. For our 
data sets, using close2seed of 30, performed the best, 
and therefore this has been implemented in the default 
orthology prediction algorithm. Overall orthology predic- 
tions from PhylomeDB obtained a good compromise 
between coverage and accuracy in the different bench- 
marks (http://orthology.benchmarkservice.org). For 
instance, in a benchmark based on the agreement with a 
reference tree of eukaryotes (32), PhylomeDB provided 
1 1 872 orthology predictions and trees reconstructed 
from these orthologs differed from the reference species 
topology by 6.7% partitions, on average. 



NEW DATA AND VISUALIZATION FEATURES 

PhylomeDB version 4 incorporates new enhancements in 
tree visualization, phylogeny annotation and tree search 
engine. First, the backend of the tree searching engine has 
been improved to provide a gene-centric view of all 
phylomeDB resources (Figure la). Thus, after a protein 
or gene search, all the available trees in phylomeDB are 
hsted and organized by phylome and tree type. Users can 
switch among all available seed and collateral trees 
without missing the focus on the searched protein or 
gene. Users can download relevant data, including the 
whole database, a specific phylome or, from the tree 
entry page, the relevant data corresponding to that tree. 
In this new release, we have implemented the possibiHty to 
download orthology predictions from a tree in the recently 
developed OrthoXML standard format (33) (Figure lb), 
in addition to a tabulated format. Second, all the infor- 
mation available for each tree is now shown using an 
integrated layout in which tree topology (Figure Ic), taxo- 
nomic data (Figure Id), alignments and domain annota- 
tions (Figure le) and event-age (phylostratigraphy) 
information are rendered in the same figure using the 
newest visualization features provided by the ETE 
toolkit version 2.2 (34): (i) PFAM domains (35) have 
been mapped to each alignment in our database and are 
now displayed in a compact panel at the right side of the 
tree (Figure le). For each sequence, domains and their 
names are shown; they can be clicked to obtain a short 
description and the external link to PFAM (Figure li). 
Protein regions not mapped to domains are shown using 
the standard amino acid color codes, whereas gap regions 
are represented by a flat fine, (ii) A taxonomy-information 
panel has been added to the right side of every tree that 
allows to highlight the main taxonomic clades present 
within each gene tree (Figure Id). Information on the 
estimated relative age (i.e. phylostratygraphy) of each 
tree node (17), extended taxonomic information and func- 
tional GO-term annotations (36) is provided by the con- 
textual menu obtained when chcking on any node, 
(iii) Tree images have been also simplified to improve 
readabihty. Mappings and/or cross-hnking to general 
and organism-oriented databases has been extended to 
include the major Arabidopsis thaliana sequence database 
TAIR (37), Drosophila's Flybase (38), Candida genome 
database (39) and the Ascomycete-based genome 
database Genolevures (40). By default a single sequence 



identifier is shown on the tree, prioritizing those that are 
more suited for human interpretation, but this can be 
adjusted through the tree editing menu (Figure If). A con- 
version table among PhylomeDB unique identifiers and 
other identifiers is provided in the download section. 
Speciation and duplication nodes are indicated using dif- 
ferent colors, and branch support values are now auto- 
matically highhghted for lowly supported partition using 
a transparent red bubble inversely proportional to the 
branch bootstrap or aLRT value (Figure Ig). Internal 
tree searches can be performed for any of the annotated 
node attributes (Figure Ih), whereas links to other data- 
bases are provided through the contextual menu of the 
tree browser that appears when clicking any node 
(Figure li), which facilitates functional inference across 
members of a gene family. Finally, the web-based 
finking API has been improved and it now allows for 
direct finks to trees and phylomes, as well as highlighting 
custom nodes within a tree topology (Figure If)- The URL 
format used by the API is detailed in Table 1 . 
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